Surface the self-approval prohibition at the top of verifier.json#148
Merged
Conversation
When a PR weakens the release policy or touches a trust root, a coding agent must not silently self-approve a change to its own gate. That prohibition was only present inside a fix_task instruction (PR #146); promote it to the two fields an agent reads first. - Add _self_approval_note(): the explicit "a coding agent cannot self-approve that change - a human must review it" message for policy_weakened (taking precedence) and trust_root_touched. - verifier.json headline leads with the note when present. - human_review.why leads with the note, and a self-approval note forces human_review.required=True regardless of the verdict path. Full suite: 2346 passed, 4 skipped. No schema change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ew fix) Addresses review of #148: a self-approval note forced human_review.required=True, but can_merge_without_human and first_next_action still keyed only off merge_verdict, so the defensive (mergeable + note) path could emit "human review required" and "safe to merge" at once. - _can_merge_without_human returns False whenever a self-approval note exists. - _first_next_action routes to a human review (never the "safe to merge" action) when a self-approval note is present, including the fix_task-None defensive case. - Both thread capability_review from _build_verifier. Clean mergeable behavior (no note) is unchanged; covered by a regression test. Full suite: 2349 passed, 4 skipped. No schema change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Promotes the self-approval prohibition to the top of
verifier.json. When a PR edits the rules that evaluate it — a weakened release policy or a touched trust root — a coding agent must never silently self-approve (reward hacking). #146 carried that message inside afix_taskinstruction; this surfaces it in the two fields an agent reads first.Why
The verifier already detects
policy_weakened/trust_root_touchedand routes them to a human, but the agent-facingheadlineandhuman_review.whystill showed the generic scan headline. An agent skimming the top of the artifact wouldn't see the most important fact: you cannot clear your own gate.Changes
_self_approval_note()— the explicit "a coding agent cannot self-approve that change — a human must review it" message.policy_weakenedtakes precedence overtrust_root_touched; clean reviews get no note.headlineleads with the note when present (ahead ofagent_summary.headline).human_review.whyleads with the note, and a note forceshuman_review.required = Trueregardless of the verdict path — defense in depth so a weakened policy can never be marked agent-clearable.tests/test_self_approval_signal.py).Verification
Full suite 2346 passed, 4 skipped, 0 failed;
generate_schemas.py --checkclean (no schema change — additive logic over the existingcapability_reviewflags); ruff clean.🤖 Generated with Claude Code